Binary Neural Networks Algorithms, Architectures, and Applications (Baochang Zhang, Sheng Xu, Mingbao Lin etc.)

BinaryBERT: Pushing the Limit of BERT Quantization

137

TABLE 5.5

Quantization results of BinaryBERT on SQuAD and MNLI-m.

Method

#Bits

Size

SQuAD-v1.1

MNLI-m

BERT-base

full-prec.

418

80.8/88.5

84.6

DistilBERT

full-prec.

250

79.1/86.9

81.6

LayerDrop-6L

full-prec.

328

82.9

LayerDrop-3L

full-prec.

224

78.6

TinyBERT-6L

full-prec.

79.7/87.5

82.8

ALBERT-E128

full-prec.

82.3/89.3

81.6

ALBERT-E768

full-prec.

120

81.5/88.6

82.0

Quant-Noise

83.6

Q-BERT

2/4-8-8

79.9/87.5

83.5

Q-BERT

2/3-8-8

79.3/87.0

81.8

Q-BERT

2-8-8

69.7/79.6

76.6

GOBO

3-4-32

83.7

GOBO

2-2-32

71.0

TernaryBERT

2-2-8

79.9/87.4

83.5

BinaryBERT

1-1-8

80.8/88.3

84.2

BinaryBERT

1-1-4

79.3/87.2

83.9

Then, the prediction-layer distillation minimizes the soft cross-entropy (SCE) between

quantized student logits ˆy and teacher logits y, i.e.,

ℓpred = SCE(ˆy, y).

(5.25)

After splitting from the half-sized ternary model, the binary model inherits its perfor-

mance on a new architecture with full width. However, the original minimum of the ternary

model may not hold in this new loss landscape after splitting. Thus, the authors further

proposed to ﬁne-tune the binary model with prediction-layer distillation to look for a better

solution.

For implementation, the authors took DynaBERT [89] sub-networks as backbones, of-

fering both half-sized and full-sized models for easy comparison. Firstly, a ternary model of

width 0.5× with the two-stage knowledge distillation is trained until convergence. Then, the

authors splited it into a binary model with width 1.0×, and perform further ﬁne-tuning with

prediction-layer distillation. Table 5.5 compares their proposed BinaryBERT with a variety

of state-of-the-art counterparts, including Q-BERT [208], GOBO [279], Quant-Noise [65]

and TernaryBERT [285] for quantizing BERT on MNLI of GLUE [230] and SQuAD [198].

Aside from quantization, other general compression approaches are also compared such

as DistillBERT [206], LayerDrop [64], TinyBERT [106], and ALBERT [126]. BinaryBERT

has the smallest model size with the best performance among all quantization approaches.

Compared with the full-precision model, BinaryBERT retains competitive performance with

signiﬁcantly reduced model size and computation. For example, it achieves more than 24×

compression ratio compared with BERT-base, with only 0.4% ↓and 0.0%/0.2% ↓drop on

MNLI-m and SQuAD v1.1, respectively.

In summary, this paper’s contributions can be concluded as: (1) The ﬁrst work to explore

BERT binarization with an analysis for the performance drop of binarized BERT models. (2)

A ternary weight-splitting method splits a trained ternary BERT to initialize BinaryBERT,

followed by ﬁne-tuning for further reﬁnement.